Biostatistics 1BI043

Karl Oskar Ekvall
Fall 2020

Organization

Content

Concepts and tools to understand scientific literature and perform statistical analyses.

  • Descriptive and exploratory statistics

  • Probability

  • Estimation and Inference

Suggestion

We will cover a lot of material in a short period of time. To follow:

  • Make good use of the lectures by asking questions

  • Know that you are not expected to understand everything at once

  • Go through slides (koekvall.github.io/biostat.html) after class and make sure everything is clear

Today

  • Random variables and realizations

  • Understanding data using descriptive statistics and plots

  • Introduction to software (R and RStudio)





Random variables, realizations, and data

Randomness and sampling

Data are often outcomes of random experiments or sampling.

If we repeat an experiment, we usually get different data.

Example

Five patients are selected at random to receive a new drug.

The effectiveness of the drug typically depends on who is in the treatment group.

Random variables

A random variable is a (often numerical) measurement of the outcome of an experiment yet to be performed.

Example

Let \( X \) be the number of minutes before 9 am that the first student joins Zoom tomorrow.

Non-example

The number of minutes before 9 am that the first student joined Zoom today.

Realizations

A realization or an observation is the particular value a random variable took when the experiment was performed.

It is common to use capital letters for random variables (\( X \)) and lower case letters for realizations (\( x \)).

Example

If tomorrow it turns out that the first student joins Zoom at 8.55, the realized or observed value of \( X \) is \( x = 5 \).

Example

The first student joined Zoom \( y \) minutes before 9 am today.

Data

We typically assume data are realizations of random variables.

Example

Select 10 students in the class at random and measure their heights in centimeters.

Let \( X_i \) denote the height of the \( i \)th randomly selected person, \( i = 1, \dots, 10 \).

After having performed the experiment, our data, or sample, may consist of the realizations

\[ \{x_1, x_2, \dots, x_{10}\} = \{165, 181, \dots, 169\}. \]





Descriptive statistics

Descriptive statistics

Even a moderately large dataset is difficult to understand by just looking at.

Descriptive statistics can help.

Definition: A statistic is a function of the data.

Definition: A statistic is something you can compute if you are given data.

A descriptive statistic is a statistic intended to tell you something useful about the data.

Descriptive statistics

Example

The sample mean or sample average is

\[ \bar{x} = \frac{1}{n}\sum_{i = 1}^n x_i = \frac{1}{n}(x_1 + \cdots + x_n). \]

The sample mean is a statistic because you can compute it if I tell you what \( x_1, \dots, x_n \) are.

Descriptive statistics

More generally, we will consider descriptive statistics that quantify:

  • Central tendency / location (e.g. sample mean)

  • Dispersion / variability

  • Asymmetry

  • Association

Measures of central tendency and location

In R, you can calculate the sample mean of any vector of observations easily:

heights <- c(165, 181, 177, 189, 185, 155, 170, 179, 172, 169)
mean(heights)
[1] 174.2

Measures of central tendency and location

Median

The middle number of the sorted data if \( n \) is odd, and the average of the two middle numbers if \( n \) is even.

sort(heights)
 [1] 155 165 169 170 172 177 179 181 185 189
median(heights)
[1] 174.5

Measures of central tendency and location

According to Credit Suisse's global wealth report:

Average wealth of an adult in Sweden in 2019 was 256,000 USD.

Median was 42,000 USD.

Depending on situation, one or the other may be a more useful measure of location.

Measures of central tendency and location

The mean is sensitive to outliers, the median is not.

income <- c(50, 30, 60, 55, 75, 300) # 1000 USD / year
mean(income)
[1] 95
median(income)
[1] 57.5

Other location measures

Sample quantiles

Quartiles partition the sample into four equally sized subsets

  sort(heights)
 [1] 155 165 169 170 172 177 179 181 185 189
  quantile(heights, c(0.25, 0.5, 0.75))
   25%    50%    75% 
169.25 174.50 180.50 

Other location measures

Sample percentiles

Percentiles partition the sample into 100 equally sized subsets

  quantile(heights, 0.1) # 10th percentile
10% 
164 

Measures of dispersion

Sample variance and standard deviation

\[ s^2 = \frac{1}{n - 1}\sum_{i = 1}^n(x_i - \bar{x})^2, \quad \text{and}\quad s = \sqrt{s^2} \]

  • Loosely speaking, the sample standard deviation tells you how much a typical observation differs from the sample mean.

However, in general, it does not equal the mean absolute deviation:

\[ s \neq \frac{1}{n}\sum_{i = 1}^n \vert x_i - \bar{x}\vert. \]

Measures of dispersion

Example

  heights - mean(heights)
 [1]  -9.2   6.8   2.8  14.8  10.8 -19.2  -4.2   4.8  -2.2  -5.2
  (heights - mean(heights))^2
 [1]  84.64  46.24   7.84 219.04 116.64 368.64  17.64  23.04   4.84  27.04
  sum((heights - mean(heights))^2) / 9 # Sample var
[1] 101.7333
  sqrt(sum((heights - mean(heights))^2) / 9) # Sample sd
[1] 10.08629

Measures of dispersion

Ranges

The range is \( \max_i x_i - \min_i x_i \) and the inter-quartile range (IQR) is the difference between the third and first quartile.

max(heights) - min(heights) # Range
[1] 34
IQR(heights)
[1] 11.25

Plotting summary statistics

To practice, we can make the following plot:

plot(heights)
abline(h = mean(heights), lwd = 2)
abline(h = median(heights), lty = 2, lwd = 2)
abline(h = quantile(heights, 0.25), lty = 2, lwd = 2)
abline(h = quantile(heights, 0.75), lty = 2, lwd = 2)
abline(h = mean(heights) + sd(heights), lty = 3, lwd = 2)
abline(h = mean(heights) - sd(heights), lty = 3, lwd = 2)

plot of chunk unnamed-chunk-8

Summary statistics for two variables

Suppose our data is a sample of \( n \) pairs:

\[ \{(x_1, y_1), \dots, (x_n, y_n)\}. \]

As before, we can summarize properties of the \( y_i \) and \( x_i \), e.g.

\[ \bar{x} = \frac{1}{n}\sum_{i = 1}^nx_i, ~~ \bar{y} = \frac{1}{n}\sum_{i = 1}^ny_i, ~~ s^2_x = \frac{1}{n - 1}\sum_{i = 1}^n(x_i - \bar{x})^2, ~~ s_y^2 = \frac{1}{n - 1}\sum_{i = 1}^n(y_i - \bar{y})^2 \]

But how can we quantify the association between them?

Covariance and correlation

Sample covariance \[ s_{xy} = \frac{1}{n - 1}\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y}). \] Intuition: Two variables have positive covariance if they tend to be larger (or smaller) than their respective means at the same time.

Notice \( s_{xx} = s_x^2 \) and \( s_{xy} = s_{yx} \).

Sample correlation \[ \rho_{xy} = \frac{s_{xy}}{s_x s_y}, \]

  • unit-free measure of association
  • same sign as the sample covariance
  • always between -1 and 1
  • symmetric in \( y \) and \( x \): \( \rho_{xy} = \rho_{yx} \)

Association

Example \[ \text{weights} = \{y_1, \dots, y_{10}\} = \{69, \dots, 67 \} \]

weights <- c(69, 78, 74, 90, 80, 61, 76, 77, 69, 67)
plot(heights, weights)
abline(v = mean(heights), lwd = 2)
abline(h = mean(weights), lwd = 2)

plot of chunk unnamed-chunk-9

Sample covariance

Example

The plot indicates a positive relationship between the variables

Their sample covariance is

  weights - mean(weights)
 [1]  -5.1   3.9  -0.1  15.9   5.9 -13.1   1.9   2.9  -5.1  -7.1
  heights - mean(heights)
 [1]  -9.2   6.8   2.8  14.8  10.8 -19.2  -4.2   4.8  -2.2  -5.2
  sum((weights - mean(weights)) *  (heights - mean(heights))) / 9
[1] 75.31111

Sample correlation

Example

cov(heights, weights) / (sd(heights) * sd(weights))
[1] 0.9230557
cor(weights, heights)
[1] 0.9230557

Since we know \( -1\leq \rho_{xy} \leq 1 \), that \( \rho_{xy} = 0.92 \) indicates a strong and positive relationship between height and weight.

Cautions

Descriptive statistics are just that—descriptions of your sample.

  • we are often interested in characteristics of an underlying population, not a particular sample
  • we are often (but not always) interested in causal mechanisms
    • if I gain weight, will I become taller?
  • there are many potentially interesting relationships between variables not captured by correlation
    • correlation is a measure of linear dependence

Anscombe's quartet

Four different datasets, each with \( n = 11 \) observations of 2 variables

data(anscombe)
round(apply(anscombe, 2, mean), 2)
 x1  x2  x3  x4  y1  y2  y3  y4 
9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5 
round(apply(anscombe, 2, sd), 2)
  x1   x2   x3   x4   y1   y2   y3   y4 
3.32 3.32 3.32 3.32 2.03 2.03 2.03 2.03 
round(c(cor(anscombe$x1, anscombe$y1),
        cor(anscombe$x2, anscombe$y2),
        cor(anscombe$x3, anscombe$y3),
        cor(anscombe$x4, anscombe$y4)), 2)
[1] 0.82 0.82 0.82 0.82

Anscombe's quartet

plot of chunk unnamed-chunk-13

Anscombe's quartet

Lessons:

  • we should plot the data if possible
  • the sample correlation is sometimes an appropriate measure of association (dataset 1) and sometimes not (dataset 4)
  • the sample mean is sometimes an appropriate measure of central tendency (dataset 1) and sometimes not (dataset 4)

Imagine someone told you the sample correlation between the dose of a drug and the number of days until a patient is symptom free is -0.82.

  • you may (obviously) want to do some further investigation before concluding the drug is effective

In practice, correlation is often useful but not the whole story.

More on plotting

Let's consider bill lenght and depths in the penguins data.

It's at https://github.com/allisonhorst/palmerpenguins and artwork is by @allison_horst.

# install.packages("palmerpenguins")
library(palmerpenguins)

More on plotting

hist(penguins$bill_length_mm); hist(penguins$bill_depth_mm)

plot of chunk unnamed-chunk-15

More on plotting

plot of chunk unnamed-chunk-16

How do the species differ with respect to length and depth?

More on plotting

Is there a correlation between length and depth?

plot(penguins$bill_length_mm, penguins$bill_depth_mm)

plot of chunk unnamed-chunk-17

Can you see three clusters, corresponding to species, in this plot?

More on plotting

plot(penguins$bill_length_mm, penguins$bill_depth_mm, col = penguins$species, pch = c(16, 17, 18)[penguins$species])

plot of chunk unnamed-chunk-18

You can certainly see the clusters in this plot!

For penguins of the same species, is there a correlation between bill length and depth?

Simpson's paradox

(This code uses \( \mathtt{dplyr} \); you do not need to learn it)

penguins %>% summarize(r = cor(bill_length_mm, bill_depth_mm, use = "complete"))
# A tibble: 1 x 1
       r
   <dbl>
1 -0.235
penguins %>% group_by(species) %>% summarize(r = cor(bill_length_mm, bill_depth_mm, use = "complete"))
# A tibble: 3 x 2
  species       r
  <fct>     <dbl>
1 Adelie    0.391
2 Chinstrap 0.654
3 Gentoo    0.643

This is known as Simpson's paradox.

Some final remarks on summary statistics and plots

str(penguins, vec.len = 1)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 ...
 $ flipper_length_mm: int [1:344] 181 186 ...
 $ body_mass_g      : int [1:344] 3750 3800 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 ...
 $ year             : int [1:344] 2007 2007 ...

Some final remarks on summary statistics and plots

  • A factor can be nominal or ordinal
    • Nominal: there is no natural ordering (species)
    • Ordinal: there is a natural ordering (low / high boody mass)
  • A binary variable takes only two values
  • A numerical variable takes numbers with the usual meaning (so addition, subtraction, multiplication, etc. makes sense)
    • An integer is a special case of a numerical variable but may obey different rules in a computer

Averages, correlations, and similar measures are mostly meaningful for numerical variables.





Probability

Content

  • Events

  • Rules of probabilities

  • Conditional probability and Bayes' theorem

Probability

To do more than descriptive and exploratory statistics, we need to consider how the data may have been generated.

Probability theory lets us use statistics to reason about what a particular sample may say about an underlying population.

Sometimes there is an actual population we are sampling from, sometimes it is a theoretical construct.

Disclaimer

Everything we will cover can be motivated formally using mathematics.

Because this is not a course in mathematics, we will state many things without formal motivation.

Remember to ask if you find anything confusing!

Probability

Events

Things to which probabilities can be assigned are called events.

An event is a set, or collection, of (potential) outcomes.

For example:

  • “it rains at 5 pm tomorrow” is an event (or at least reasonably modeled as such)
  • If \( X \) is the result of rolling a six-sided die, \( X \geq 2 \) is an event consisting of the outcomes \( X = 2, \dots, X = 6 \), each of which are also an event.
  • “I am 183 cm tall” is not an event, but if we select a person at random, then “they are 183 cm tall” is an event.

Probability

We often use letters such as \( A \) and \( B \) to denote events.

At the end of this section, we will answer the following question:

Let \( A \) be the event that a randomly sampled person is a cannabis user.

Let \( B \) be the event that a randomly sampled person tests positive for cannabis.

Suppose that the test we are using is right in 90 % of the cases and that 5 % of the population are cannabis users.

What is the probability that a randomly selected person is a cannabis user given that they test positive?

Probability

We write the probability of the event \( A \) as \( P(A) \).

For example, it may be that \( P(X = 5) = 1/2 \) or \( P(\text{it rains tomorrow at 5 pm}) = 0.1 \).

We have the following:

  1. For any event \( A \), \( 0\leq P(A) \leq 1 \).
  2. If two events \( A \) and \( B \) cannot happen at the same time (they are disjoint), then the probability that (at least) one of them happens is \( P(A) + P(B) \).
  3. If \( A \) contains \( B \); that is, \( A \) happens whenever \( B \) happens, then \( P(A) \geq P(B) \).

Probability

We can define new events, for example \( C \) can be defined to be “\( A \) or \( B \)”; that is, \( C \) happens if either \( A \) happens, \( B \) happens, or both \( A \) and \( B \) happen.

We often write \( A\cup B \); you can read this as \( A \) union \( B \).

Example

Let \( A \) be the event that it rains tomorrow at 5 pm and let \( B \) be the event that I am late for class tomorrow. If \( C \) is defined as “\( A \) or \( B \)”, or \( C = A\cup B \), then \( C \) is the event that either it rains tomorrow at 5 pm, or I am late to class tomorrow, or both.

Probability

We can also define \( D \) to be “\( A \) and \( B \)”; that is, \( D \) happens if and only if both \( A \) and \( B \) happen.

We often write \( A\cap B \), which is called the intersection of \( A \) and \( B \).

Example

If \( A \) is the event that it rains tomorrow at 5 pm and \( B \) the event that I am late for class tomorrow, and if \( D \) is “\( A \) and \( B \)”, or \( D = A\cap B \), then \( D \) is the event that it both rains tomorrow at 5 pm and I am late for class. In particular, \( D \) does not happen if only one of \( A \) or \( B \) happens.

Exercise

Show (that is, use the stated facts about probabilities to argue) that \( P(C) \geq P(D) \).

Probability

The complement of \( A \) is the event “not \( A \)”.

We often write this as \( A^c \).

The probability of \( A^c \) is always \( 1 - P(A) \).

Motivation

Either \( A \) happens or it doesn't, so \( P(A\cup A^c) = 1 \).

Because \( A \) and \( A^c \) cannot happen at the same time, one of the rules of probabilities says

\[ 1 = P(A \cup A^c) = P(A) + P(A^c). \]

Probability

This is a Venn diagram. The rectangle is called the sample space. Events are subsets of the sample space.

You can think of probabilities as sizes of subsets. The “size” of the sample space is always 1.

Probability

Find \( A\cup B \), \( A\cap B \), \( A\cap C \), \( A\cup C \), \( A\cup B \cup C \), and \( (A\cap B) \cup (B\cap C) \)

Probability

\( A\cup B \)

Probability

\( A\cap B \)

Probability

\( A\cap C \)

Probability

\( A\cup C \)

Probability

\( A\cup B\cup C \)

Probability

\( (A\cap B) \cup (B\cap C) \)

Probability

By using Venn diagrams, we can convince ourselves that

\[ P(A \cup B) = P(A) + P(B) - P(A\cap C) \]

Finally, two events \( A \) and \( B \) are called independent if

\[ P(A\cap B) = P(A)P(B). \]

Example

If I roll a regular six-sided die twice, what is the probability that the first roll is 2 and the second is 4? You may assume the rolls are independent.

Probability

Answer: Let \( X_1 \) be the result of the first roll and \( X_2 \) that of the second. Then, by independence, \( P(X_1 = 2 \cap X_2 = 4) = P(X_1 = 2)P(X_2 = 4) = (1/6)(1/6) = 1/36 \).

Probability

Example

If I roll a regular six-sided die twice, what is the probability that one of the rolls is 2 and the other is 4? You may assume the rolls are independent.

Probability

Answer: Let \( X_1 \) be the result of the first roll and \( X_2 \) that of the second. First, let's figure out which outcomes are in our event. One outcome is that the first roll is 2 and the other is 4. Another is that the first is 4 and the other 2. There are no other outcomes in our event.

Thus, we are looking for

\[ P[(X_1 = 2 \cap X_2 = 4)\cup (X_1 = 4 \cap X_2 = 2)]. \]

Because the events in the union are disjoint and the rolls are independent, this is equal to

\[ P(X_1 = 2 \cap X_2 = 4) + P(X_1 = 2 \cap X_2 = 4) = 2/36. \]

Conditional probability

Conditional probabilities are like probabilities, but with extra information.

Example

Let \( A \) be the event that a die roll is at least \( 3 \), and let \( B \) be the event that the same roll is even. What is the probability of the roll being at least 3 if we are told it is even? That is, what is the probability of \( A \) given \( B \), or

\[ P(A\mid B)? \]

Conditional probability

Given that the roll is even, it has to be one of 2, 4, and 6.

Because every outcome is equally likely and two of those three are greater than 3, intuition suggests

\[ P(A\mid B) = 2/3. \]

Intuition is usually not good at probability!

Conditional probability

The conditional probability can be calculated as

\[ P(A\mid B) = \frac{P(A\cap B)}{P(B)}. \]

In this case, intuition was right because

\[ \frac{P(A\cap B)}{P(B)} = \frac{P(X = 4 \cup X = 6)}{P(X = 2 \cup X = 4 \cup X = 6)} = \frac{2/6}{3 / 6} = 2/3 \]

Note: if \( P(B) = 0 \), then \( P(A\mid B) \) is not defined.

Conditional probability

Much of applied research is concerned with conditional probabilities.

Example

If I randomly sample a patient to a study, what is the probability that they develop lung cancer given that they are a smoker?

If this conditional probability is significantly greater than the probability that they develop lung cancer given that they are not a smoker, then this can be an indication that smoking increases the risk of developing lung cancer.

Conditional probability

Recall that \( A \) and \( B \) are independent if \( P(A\cap B) = P(A)P(B) \).

Thus, if \( A \) and \( B \) are independent,

\[ P(A\mid B) = \frac{P(A\cap B)}{P(B)} = \frac{P(A)P(B)}{P(B)} = P(A). \]

Knowing \( B \) gives you no information about how likely \( A \) is to occur.

The law of total probability

The probability that \( A \) happens is the probability that \( A \) and \( B \) happen, or that \( A \) and \( B^c \) happen.

Draw a Venn diagram to convince yourself that

\[ A = (A\cap B)\cup (A\cap B^c) \]

Because \( A\cap B \) and \( A\cap B^c \) are disjoint,

\[ P(A) = P(A\cap B) + P(A\cap B^c) = P(A\mid B)P(B) + P(A\mid B^c)P(B^c). \]

Bayes' theorem [!]

\[ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} \]

Example (from https://en.wikipedia.org/wiki/Bayes%27_theorem)

Let \( A \) be the event that a randomly sampled person is a cannabis user.

Let \( B \) be the event that the randomly sampled person tests positive for cannabis.

What is the probability that a randomly sampled person who tests positive for cannabis is a cannabis user?

Bayes' theorem [!]

Make the following assumptions:

  1. The test's true positive rate is 0.9, or \( P(B \mid A) = 0.9 \).

  2. The test's true negative rate is 0.95, or \( P(B^c \mid A^c) = 0.95 \)

  3. In the population, 5 % are cannabis users, so \( P(A) = 0.05 \).

Bayes' theorem [!]

We want to compute \( P(A\mid B) \), and Bayes' theorem tells us

\[ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} = \frac{0.9\times 0.05}{ P(B)}, \]

The law of total probability tells us \( P(B) = P(B\mid A)P(A) + P(B\mid A^c)P(A^c) \).

We know \( P(B\mid A)P(A) = 0.9 \times 0.05 \) and \( P(A^c) = 1 - P(A) = 0.95 \).

We can compute \( P(B\mid A^c) = 1 - P(B^c \mid A^c) = 0.05 \).

Bayes' theorem [!]

We get

\[ P(A\mid B) = \frac{0.9\times 0.05}{0.9\times 0.05 + 0.05 \times 0.95} = \frac{90}{90 + 95} \approx 0.49. \]

Even after having observed a positive test, it is more likely the person is not a user.

Intuition is often wrong about Bayes' theorem because intuition usually suggests that if you use an accurate test, then you can trust its result.

Summary of probability rules [!]

  1. \( 0\leq P(A) \leq 1 \)
  2. \( P(A \cup B) = P(A) + P(B) - P(A\cap B) \), which equals \( P(A) + P(B) \) if \( A \) and \( B \) are disjoint
  3. \( P(A) \geq P(B) \) if \( B \) is a subset of \( A \) (\( A \) contains \( B \))
  4. \( P(A\cap B) = P(A)P(B) \) if \( A \) and \( B \) are independent.
  5. \( P(A\mid B) = P(A\cap B) / P(B) \) if \( P(B) > 0 \).
  6. \( P(A\mid B) = P(B\mid A)P(A) / P(B) \) (Bayes' theorem, follows from 5.)

    Remember to draw a Venn diagram if you are unsure!

Probability

Exercise

Is it true, for any events \( A \) and \( B \), that \( P(A) \geq P(A\cap B) \) and \( P(A) \leq P(A \cup B) \)?

Why?

Can the inequalities be equalities for some specific choices of \( A \) and \( B \)?





Random variables

Random variables: content

  • Discrete random variables

  • Continuous random variables

  • Distributions

Random variables

Most events we will calculate probabilities for involve discrete or continuous random variables.

Recall that a random variable is a, typically numerical, measurement of the outcome of an experiment yet to be performed.

Discrete random variables

Discrete variables can take at most countably many values (countable support).

“Countably” has a mathematical definition, but it is quite literal: you can count the possible values.

They can can be finitely or infinitely many.

Example

The set \( \{1, 1/2, 1/3, 1/4, \dots\} \) is countable and infinite.

Random variables

Example

Suppose I flip a coin and if it comes up heads, I flip again. If it comes up tails, I stop.

Let \( X \) be the number of flips I will have made at the end of this experiment.

It is possible that I flip \( 1000 \) heads in a row, but highly improbable.

The same is true for any integer. Thus, \( X \) can take the values \( 1, 2, 3, \dots \) and is therefore a discrete random variable that can take infinitely many values.

Fun fact

With advanced probability theory one can show that the probability of continuing flipping forever is zero.

Random variables

Continuous random variables

Continuous variables can take an uncountable number of values (uncountable support). That is, you cannot count the possible values even if you keep counting forever.

In practice, we only deal with discrete variables because we cannot measure or store with infinite precision.

Nevertheless, it is often a useful approximation to assume continuous variables.

Example

The number of (decimal) numbers between \( 0 \) and \( 1 \).

The time it takes for my daughter to tie her shoes in the morning.

Random variables

Distribution

The rule (law) telling us the probabilities that \( X \) takes certain values is called the distribution of \( X \).

Every random variable \( X \) has a cumulative distribution function (cdf).

Cumulative distribution function

The cdf of a random variable \( X \) is the function defined by \( F(x) = P(X \leq x) \).

You plug in \( x \), the cdf tells you the probability that \( X \) is less than or equal to \( x \).

Random variables

It may not be obvious, but the cdf tells you everything there is to know about the distribution of \( X \).

We say that \( F \) characterizes the distribution of \( X \).

In theory, if you know \( F \), you can calculate any probabilities involving \( X \).

Discrete random variables

Discrete probability distributions also have a probability mass function (pmf).

Probability mass function

The pmf of a discrete \( X \) is the function defined by \( f(x) = P(X = x) \).

You plug in \( x \), the pmf tells you the probability that \( X = x \).

Discrete random variables

Example

We say that \( X \) has a Bernoulli distribution with parameter \( 0 \leq p \leq 1 \), or \( X\sim \mathrm{Ber}(p) \), if

\[ P(X = 1) = p \\ P(X = 0) = 1 - p \\ \] plot of chunk unnamed-chunk-21

Discrete random variables

If \( X \) is the number of successes in \( n \) independent trials, each with success probability \( p \), then \( X \) has a binomial distribution with parameters \( n \) and \( p \), or \( X\sim \mathrm{Bin}(n, p) \).

Example

Suppose I flip \( n = 10 \) coins and let \( X \) be the number of heads I get, then \( X \) has a binomial distribution with parameters \( n = 10 \) and \( p = 1/2 \) (assuming the coin is fair).

The pmf of \( X \) tells us the probability that \( X = x \) for every possible \( x \).

Discrete random variables

The binomial has pmf \[ f(x) = \binom{n}{x} p^x(1 - p)^{n - x}, \quad x = 0, \dots, n. \] The binomial coefficient is the number of ways to select \( x \) from \( n \).

Luckily, R can compute this for us.

plot of chunk unnamed-chunk-22

Discrete variables

Example

Suppose I flip \( 10 \) coints and let \( X_i \) be one if the \( i \)th flip is heads, and 0 otherwise. Then \( \sum_{i = 1}^n X_i \) is the number of heads I get in \( 10 \) flips. This illustrates the following fact:

If \( X_1, \dots, X_n \) are independent \( \mathrm{Ber}(p) \), then \( X = \sum_{i = 1}^n X_i \sim \mathrm{Bin}(n, p) \).

Discrete variables

Example

Suppose 50 % of all penguins in the population are of the Adelie species.

Assuming penguins were sampled independently, what was the probability of obtaining the number of Adelie penguins in the penguins data?

table(penguins$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 
# The probability of 'x' successes in 'size' independent trials with success probability 'prob'
dbinom(x = 152, size = 152 + 68 + 124, prob = 0.5) # d for density
[1] 0.00420746

Discrete variables

Example

What was the probability of obtaining at least as many Adelie penguins as in the penguins data?

# Use the cdf
# 1 - P(X <= n_success - 1) = P(X > n_success - 1) = P(X >= n_success)
1 - pbinom(q = 152 - 1, size = 152 + 68 + 124, prob = 0.5)
[1] 0.9865395

Exercise

What was the probability of obtaining fewer Adelie penguins than in the penguins data?

Discrete variables

Exercise

Consider rolling three dice and let \( X \) be the number of 6s rolled. What is the distribution of \( X \)?

Discrete variables

Exercise

Consider rolling two dice and let \( X \) be the number of 1s rolled. Find the probability mass function for \( X \).

Answer: First note that \( X \) can take three values: \( 0, 1 \), or \( 2 \). Thus, we need to find \( f(x) = P(X = x) \) for \( x = 0, 1, 2 \). Let \( X_i \) be one if the \( i \)th roll is 1 and zero otherwise. The event that \( X = 0 \) is the same as \( X_1 = 0 \cap X_2 = 0 \). Assuming the rolls are independent, one of the rules of probabilities says

\[ P(X = 0) = P(X_1 = 0 \cap X_2 = 0) = P(X_1 = 0)P(X_2 = 0)= (5/6)(5/6) = 25/36. \]

The event \( X = 1 \) consists of the outcomes \( X_1 = 0 \cap X_2 = 1 \) and \( X_1 = 1 \cap X_2 = 0 \). That is,

\[ (X = 1) = (X_1 = 0 \cap X_2 = 1) \cup (X_1 = 1 \cap X_2 = 0) \]

Discrete variables

The events in the union are disjoint, so the rules of probabilities say

\[ P(X = 1) = P(X_1 = 0 \cap X_2 = 1) + P(X_1 = 1 \cap X_2 = 0), \]

which, assuming independence, is

\[ P(X_1 = 0)P(X_2 = 1) + P(X_1 = 1)P(X_2 = 0) = (5/6)(1/6) + (1/6)(5/6) = 10/36. \]

It remains to find \( P(X = 2) \). Can do similar calculation, or note that

\[ (X = 2)^c = (X = 0) \cup (X = 1) \]

so

\[ P(X = 2) = 1- P(X \neq 2) = 1 - (25/36 + 10/36) = 1/36. \]

Discrete variables

One of the most commonly used distributions in practice is the Poisson distribution.

Poisson distribution

A random variable \( X \) has a Poisson distribution with parameter \( \lambda > 0 \) if it has pmf

\[ f(x) = e^{-\lambda} \lambda^x / x!, \quad x! = x (x - 1)(x - 2)\cdots 2, \quad x = 0, 1, \dots \]

Discrete variables

x_vals <- 0:10; lambda <- 4; pmf <- dpois(x_vals, lambda)
plot(x_vals, pmf , type = "h"); points(x_vals, pmf)

plot of chunk unnamed-chunk-25

Discrete variables

For a discrete \( X \), its mean (expected value) and variance are

\[ \mu = E(X) = \sum_{x}x f(x) = \sum_{x} x P(X = x) \]

\[ \sigma^2 = \mathrm{var}(X) = \sum_x(x - \mu)^2 f(x) = \sum_x(x - \mu)^2 P(X = x) \geq 0 \]

The sums are over all \( x \) such that \( P(X = x) > 0 \).

These are average or expected values of \( X \) and \( (X - \mu)^2 \).

The standard deviation of \( X \) is \( \sqrt{\mathrm{var}(X)} \).

Discrete variables

Intuition

If \( X \) has large mean, then if we observe many independent realizations they will be large on average.

If \( X \) has large variance, then if we observe many independent realizations they will be very different.

Exercise [*]

Show (or remember) that

  1. \( E(X - c) = E(X) - c \)
  2. \( E(cX) = c E(X) \)
  3. \( \mathrm{var}(X - c) = \mathrm{var}(X) \)
  4. \( \mathrm{var}(cX) = c^2 \mathrm{var}(X) \).

Discrete variables

Example

The mean and variance of \( X\sim \mathrm{Ber}(p) \) is

\[ \mu = 1 \times P(X = 1) + 0 \times P(X = 0) = 1 \times p + 0 \times (1 - p) = p \]

\[ \sigma^2 = (1 - p)^2 \times P(X = 1) + (0 - p)^2 \times P(X = 0) = (1 - p)^2 p + p^2(1 - p) = p - p^2. \]

Discrete variables

One can show that if \( X \) is binomial with parameters \( n \) and \( p \) and \( Y \) is Poisson with parameter \( \lambda \), then

\[ E(X) = np,\quad \mathrm{var}(X) = n(p - p^2) \]

\[ E(Y) = \lambda,\quad \mathrm{var}(Y) = \lambda. \]

Summary discrete variables

  • A discrete random variable is one that has finite or infinite but countable support.
  • The distribution of a discrete random variable \( X \) is characterized by its pmf \( f(x) = P(X = x) \).
  • Uniform, Bernoulli, Binomial, and Poisson (there are many others)
  • You can calculate mean and variance by sums

Continuous variables

Continuous variables have uncountable support.

The distribution of continuous random variables are characterized by a probability density function (pdf) \( f(x) \).

The area under the graph of a pdf tells you the probability of \( X \) taking its value in that region.

plot of chunk unnamed-chunk-26

Continuous variables

In general,

\[ P(a < X \leq b) = \int_a^b f(x)\, \mathrm{d} x = F(b) - F(a). \]

You will not have to integrate anything analytically in this class—we will use R!

In R, you can evaluate cdfs for common distributions.

pnorm(0) - pnorm(-2) # The area in the previous slide
[1] 0.4772499

Continuous variables

You should know:

  • The total area under a pdf is \( 1 \) (probability that \( X \) is between \( -\infty \) and \( \infty \))

  • The probability that \( X \) is between \( a \) and \( b \) is the probability that \( X \) is less than \( b \) minus the probability that \( X \) is less than \( a \), so we can compute it as \( P(a < X < b) = F(b) - F(a) \).

    Exercise

    Use basic rules of probabilities to explain why the second point is true.

Continuous variables

Why don't we characterize continuous distributions by a pmf \( f(x) = P(X = x) \)?

It is outside the scope of this class to prove, but you should know that, for a continuous \( X \),

\[ P(X = x) = 0 \quad \text{for every $x$!} \]

A continuous \( X \) has mean and variance

\[ \mu = E(X) = \int_{-\infty}^{\infty} x f(x)\, \mathrm{d}x \] and \[ \sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x)\, \mathrm{d}x. \]

Continuous variables

Common continuous distributions

Normal with mean \( \mu \) and variance \( \sigma^2 \): \( f(x; \mu, \sigma^2) = e^{-(x - \mu)^2 / (2\sigma^2)} / \sqrt{2\pi \sigma^2} \)

Uniform on \( [a, b] \): \( f(x) = 1 / (b - a) \) for \( a \leq x\leq b \)

Exponential with parameter \( \lambda > 0 \): \( f(x) = \lambda e^{-\lambda x} \) for \( x \geq 0 \)

plot of chunk unnamed-chunk-28

Continuous variables

The probability that they are less than 0.9:

  pnorm(0.9)
[1] 0.8159399
  pexp(0.9)
[1] 0.5934303
  punif(0.9)
[1] 0.9

Exercise

What is the probability that they are greater than \( 0.8 \)? How to calculate it in R?

Continuous variables

One can show that if \( X \) is exponential and \( Y \) uniform on \( [a, b] \), then

\[ E(X) = 1 / \lambda,\quad \mathrm{var}(X) = 1 / \lambda^2 \]

\[ E(Y) = (b - a)/2,\quad \mathrm{var}(Y) = (b - a)^2 /12. \]

Connection to the histogram

Recall that the histogram tells you how many observations in a certain interval.

  hist(penguins$flipper_length_mm)

plot of chunk unnamed-chunk-30

If you make the intervals smaller and the sample larger, eventually the histogram should look like the pdf of a randomly selected penguin's flipper length.

Connection to the histogram

  x <- rnorm(5); y <- rnorm(10); z <- rnorm(50); w <- rnorm(5000)

plot of chunk unnamed-chunk-31

Thus, if the sample is large enough, we should be say something about penguins in general, not just the ones in our sample; that's the goal of statistical inference!